Infrastructure as Code for Data & AI: Tools, Tradeoffs and Best Practices

LESSONS FROM THE FIELD: APPLYING IaC TO REAL-WORLD DATA & AI SYSTEMS

The rise of hyperscaler cloud computing platforms like AWS, Azure, and GCP, and the elastic infrastructure deployments for storage and computing resources that those platforms provide, has allowed for and even encouraged the adoption of Infrastructure-as-Code-style software deployments to become best practice for cloud solution change management across the software industry. Over time, this Infrastructure-as-Code deployment strategy has become increasingly relevant in data and AI engineering, with tools such as Terraform, CDK, Bicep, and Databricks Asset Bundles being used to configure and deploy cloud storage, metadata, data transformation jobs, pipelines, workflows, and more recently, even AI agents and chatbots. As a senior data & AI engineer at Data Elephant, I work hands-on with enterprise systems every day, and I have seen first-hand how a lack of IaC, change management, and deployment strategies in a data-focused organization can slow teams down, while making maintenance difficult, and scaling impossible. In this article I will review what Infrastructure-as-Code is, compare its benefits and drawbacks, explore some of the current standard IaC technologies available today, and discuss how and why Data Elephant leverages IaC deployments to empower our clients.

What is Infrastructure-as-Code?

Infrastructure as Code (IaC) is a configuration management and deployment strategy for creating, tracking, updating, and decommissioning resources and services on various cloud platforms. Cloud solution architecture deployments often involve specifying many important and fragile configuration values, such as the level of computing power, number and types of instances, networking rules or storage sizing, to name a few. For example, when deploying a new cloud computing cluster for data processing, the types and sizes of the driver and worker nodes, and the number of each, represent a critical set of configuration values for the performance and cost of the cluster. Improperly setting these values can lead to unnecessarily long wait times and/or high costs for data processing. Finding the right configuration for a full cloud solution is an iterative process, fine-tuned over the lifetime of the project.

There are often important dependencies between these resources as well, that determine the order in which they need to be deployed. For example, configuring SageMaker AI for Machine Learning (ML) on AWS relies on the existence of S3 buckets and paths, Elastic Container Registry repositories etc. All of which need to be specified upfront to “spin up” the required services for ML training and hosting.

Many organizations manage all this configuration directly through the cloud portal/console when creating and managing resources. For example, to create the S3 bucket in AWS, developers may navigate to the S3 service in the AWS browser console and use the buttons on the User Interface (UI) to create the new bucket. There are several reasons why this approach to resource management fails as organizations grow and scale their cloud computing resources. To name a few, if all resources are defined and deployed from the browser UI, all the work to define that configuration would have to be duplicated and set up again from scratch when deploying the solution to a new environment, such as when promoting changes from the Development environment (dev) to the Production environment (prod). Moreover, there is no guarantee that the solution implemented in prod is identical to the one tested in dev, and no simple rollback mechanism when a change fails. Worse still, if a developer were to unintentionally overwrite some configuration values, the original values may be impossible to recover.

The solution that IaC tools provide is a format or language in which to specify the configuration for all cloud resources in flat configuration files, which act as a blueprint for which cloud resources get configured and deployed. Many IaC systems are declarative, meaning the user provides a desired state for the system, and the tool defines the changes required in the existing cloud environment to match that desired state, then deploys those changes.

Some IaC tools are lower level and/or native to the platform such as AWS CloudFormation, or Azure Resource Manager (ARM)/Bicep. Some are higher level, and cloud-agnostic, such as Terraform, which can define and deploy resources to many different platforms (a comprehensive list of IAC tools can be found at the end of this article.) Most of these systems also provide a means of templating and defining variables which can be reused across all IaC resource configuration files. This allows for redeployment of the same set of resources that make up the solution, in the same way across multiple environments, by using variables to reference the names of the different resources in each environment.

For illustration, below is an example of a simple Terraform configuration for an S3 bucket, with versioning and default AES256 encryption enabled:

Terraform Configuration for an AWS S3 Bucket, with versioning and encryption enabled

Benefits of Infrastructure as Code

One-Click promotions from Dev -> Staging -> Prod environments: To promote code/infrastructure changes from dev to staging, and then from staging to prod, instead of painstakingly setting up the exact same infrastructure components in a new environment, simply use variables for all resource names and ids, then run the deployment for the new environment. The solution deployed in this new environment is guaranteed to be consistent with the previous environment, since they are defined based on the same template.

Provides complete & accurate system and dependency descriptions: Because configuration for all resources in the solution is defined in code, and deployed from that code, the full set of IaC configuration files used to deploy it provides a comprehensive system description with all resources and associated configurations included. These files also implicitly track all references between different infrastructure components in the solution, providing an explicit description of all dependencies across cloud resources and services.

Consumable by LLM/GenAI tools: Another benefit of having the system architecture defined in code using an IaC deployment tool is that you can pass the configuration files to AI systems and coding agents to help actively maintain your architecture over time, or to generate readmes, visuals, and documentation. This would not be possible if all the configuration was defined and captured only within the console.

Minimize costs by cleaning up unused resources: Stale, test, and temporary resources that are no longer needed are not left around in the cloud account, potentially incurring unnecessary costs. They are either consciously kept or deleted from the solution by removing the associated resource file in the code repository and then rerunning the deployment. Because that resource no longer exists in the configuration, it will be deleted from the cloud account.

Version control: Configuration files can be included in the project code base and tracked by version control systems like git. This opens up all the benefits of tracking code with version control such as versioning, branching, and pull requests. Here are a few of the benefits of using version control to track IaC files:

Collaboration: Developers can easily push and pull each other’s changes without copy pasting or duplicating code. Code review, pull request, and merge conflict resolution tools are built right in.

Versioning: Using branches or tagged versions, developers can create different versions of the infrastructure to easily deploy and destroy for experimentation and testing. Different versions can then be easily reintegrated using branch merges.

Rollbacks: In case there is an issue with a prod deployment that was not caught during testing, having the infrastructure deployed from a version control repository allows for quick and simple rollbacks by just reverting any code changes and re-running the deployment.

Deployment security: Allow prod deployments to run only from main or staging branches, so that developers cannot directly apply (potentially breaking) changes to the prod account. To deploy their changes to prod, developers need to submit a pull request, get a code review, and integrate their code changes first.

Enabling Continuous Integration, Continuous Deployment (CI/CD), and DevOps: These deployments can be integrated as a step in a DevOps pipeline to implement automated, repeatable infrastructure deployments coming exclusively from a trustworthy, secure code branch. This turns data infrastructure changes into the same controlled, repeatable procedures as application code releases.

Previewable changes: Most IaC tools let you generate a “plan” that shows exactly what will be created/changed/destroyed before any changes are made. This makes infrastructure changes reviewable and reduces the risk of accidental destructive updates.

Standardization and reuse: IaC often supports deploying pre-defined, shared building blocks of multiple cloud resources (called modules or templates) at once for common patterns like networks, storage, identity, logging, and compute.

Policy, compliance, and guardrails as code: IaC makes it much easier to enforce standards automatically (encryption required, private networking, tagging rules, approved regions/SKUs, restricted public access, etc...).

Drawbacks of Infrastructure-as-Code

No console handholding: IaC tools require developers to understand the implications of configuring the underlying cloud services in a particular way, and how those configured services will interact and/or which resources will be created as dependencies. The guided UI experience, including setup wizards and helpful warnings you often get from the console when configuring services manually, is not present when using these tools.

State drift: Most IaC tools track a current state of everything that is already deployed, and real-world environments can drift when changes are made outside of IaC (manual console edits, emergency hotfixes, one-off scripts). Teams need discipline and good processes to keep IaC as the single source of truth, and to reconcile drift when it happens. In practice, this usually means that any manual changes made in the console are simply wiped out and overwritten with the next IaC deployment.

Secrets Management Strategy: It is unacceptable from a security perspective to include service account passwords and/or API keys in a code repository. Therefore, when deploying services to track usernames, passwords, and keys (such as Azure Key Vault or AWS Secrets Manager) using IaC, a secrets management strategy is required. Two common solutions to this problem are first, to use the secrets management strategy of the CI/CD deployment tool, injecting secret values directly at deployment-time so they are not exposed in code, or second, to deploy the secret with a “CHANGE-ME” value, then manually update the value once deployed.

Based on these arguments, we see that although leveraging IaC deployments does require some organizational/procedural maturity, and some upfront understanding and implementation, the benefits of utilizing an IaC strategy for data platform deployments heavily outweigh the drawbacks. For the reasons above, Data Elephant always recommends developing and implementing an IaC deployment strategy as part of a client organization’s overall data platform strategy.

Data Elephant’s Infrastructure-as-Code Strategy

IaC is considered standard best practice at Data Elephant. We often work with clients without an existing IaC deployment strategy to develop one as part of their overall data platform maturity. This often includes developing a project roadmap to implement those changes technically, and within the organization. We never bring in pre-built solutions however. We tailor each solution to the needs of the client, so although we believe in and advocate for IaC as standard practice, if clients prefer not to use it, we are able to work flexibly with or without it.

Because we are cloud-agnostic and provide services across many different cloud data platforms, for net-new IaC implementations, we lean towards higher-level tools which are also cloud agnostic, and in particular, Terraform. For clients set up on Databricks, we also help set up IaC deployments of jobs and workflows using Databricks Asset Bundles. For clients with multiple cloud accounts for dev, staging, prod, etc. (also considered standard best practice at Data Elephant), we leverage Terragrunt, a Terraform wrapper which simplifies multi-account deployments.

Current Standard Infrastructure-as-Code Technologies

For completeness, the following is a list of some of the standard cloud IaC tools commonly used today:

  • Terraform (or OpenTofu): A cloud-agnostic tool where infrastructure is configured in a custom format called HCL and applied consistently across environments. Commonly used as the default baseline for multi-cloud deployments.

  • Terragrunt: A thin wrapper around Terraform/OpenTofu that reduces duplication and simplifies multi-environment and multi-account deployments.

  • CloudFormation: AWS’s native template-based IaC system for provisioning AWS resources. It’s tightly integrated with AWS services and follows AWS’s deployment model closely.

  • AWS CDK: An AWS-native framework where infrastructure is defined using well-known programming languages and synthesized to CloudFormation. Helpful when you want reusable abstractions beyond raw templates, or prefer to configure resources in a common imperative programming language like Python.

  • Azure Bicep: Azure’s modern IaC language that compiles to Azure Resource Manager (ARM) templates, with a cleaner authoring experience than ARM JSON. A strong default for Azure-first organizations.

  • Databricks Asset Bundles: A Databricks CLI-based way to package and deploy resources like jobs, pipelines, and workflows as code across Databricks environments.

  • Pulumi: A multi-cloud IaC tool that provisions infrastructure using general-purpose programming languages. Often chosen when teams want infrastructure definitions to look and behave like application code.

Next
Next

Small Steps Create Big Shifts